A Reinforcement Learning approach to study climbing plant behaviour

A plant’s structure is the result of constant adaptation and evolution to the surrounding environment. From this perspective, our goal is to investigate the mass and radius distribution of a particular plant organ, namely the searcher shoot, by providing a Reinforcement Learning (RL) environment, that we call Searcher-Shoot, which considers the mechanics due to the mass of the shoot and leaves. We uphold the hypothesis that plants maximize their length, avoiding a maximal stress threshold. To do this, we explore whether the mass distribution along the stem is efficient, formulating a Markov Decision Process. By exploiting this strategy, we are able to mimic and thus study the plant’s behavior, finding that shoots decrease their diameters smoothly, resulting in an efficient distribution of the mass. The strong accordance between our results and the experimental data allows us to remark on the strength of our approach in the analysis of biological systems traits.

task.Beyond these, four sub-elements characterize the RL system [1]: policy, reward, value function and model.The meaning of these sub-elements is the following: • The policy represents how the agent chooses the action based on the current state; • The reward is the goal of the RL problem.It is a feedback signal defining the good and bad events for the agent; • The value function is the total amount of reward an agent can expect to accumulate over the future, starting from a specific state.It helps the agent to understand the long-term consequences of actions; • The model of the environment is an optional representation of the environment, which allows the planning of possible future situations before their experience.
We can express an RL problem using the mathematical formalism of the Markov Decision Process (MDP), used to study the control of sequential decisions that can influence states and future rewards.An MPD is a tuple M = ⟨S, A, R, P, γ⟩ where S and A are the state and the action space, respectively; R is the reward function R : S × A → R, representing the immediate reward; and, P is the transition function P : S × S × A → [0, 1], and so the probability to move from a state to another having chosen an action.Finally, γ represents the discount factor, namely the chance for the agent to choose between an instant (short-sighted agent) and a future reward (farsighted agent).
Briefly, at each time step t, in a state s t ∈ S, an agent interacts with the environment and chooses an action a t ∈ A, which leads to a reward r t+1 = R(s t , a t ) and a transition to a new state s t+1 ∈ S. The probability to reach the state s t+1 is given by P(s t , s t+1 , a t ).The choice of the action a t relies on the policy adopted by the agent.Formally, a policy is a function π : S × A → [0, 1] which gives the probability of choosing an action a ∈ A knowing that the agent is in the state s ∈ S. The goal is to maximize the total reward, learning a policy.

Proximal Policy Optimization {SubSec: PPO Algorithm}
We can divide the RL algorithms into two main categories: Model-Based, in which a system uses a predictive model of the world to choose the best action (i.e., the algorithm exploits the knowledge of a Markov Decision Process); Modelfree, in which the agent learns a value function or a policy by interacting with the environment [2].
Having a model means relying on a function that predicts the future states and rewards, allowing the agent to plan by thinking ahead and explicitly deciding between its options.Instead, the absence of it means that the agent uses only the current state and its experience to learn.
In our work, we exploit the model-free approach in the form of the Proximal Policy Optimization (PPO) algorithm, introduced by Schulman et al. [3] in 2017.PPO is a policy gradient, on-policy algorithm, meaning that the algorithm is searching for an approximation of the best policy through a parameter θ, and each step for the upgrade of the policy π θ relies on a sampling based on π θ itself.There are two primary variants: PPO-Penalty and PPO-Clip, which we use.
In the PPO-Clip approach, the update of the parameter θ k to θ k+1 relies on the maximisation of the following surrogate objective: A π θ k is the advantage function related to the policy π θ k , E (s,a)∼π θ k stands for the average with respect to (s, t).With this notation, the actions a are distributed according to the policy π θ k and the states s follow the stationary distribution of the Markov chain for the policy π θ k .The proof of the convergence of this method is in [4].This approach tries to increase the probability of taking the best action without moving too far from the current policy, avoiding the system collapse (trust region approach [5]).Indeed, in L the hyperparameter ϵ represents how far the new policy can be from the old one.If A is positive, the picked action is better than the expectations, and it becomes more likely to choose it again.Otherwise, if A is negative, the picked action will be less called.The minimum and the function g limit the policy change by imposing the probability ratio to stay within an interval of amplitude 2ϵ of around 1.

Derivation of the model
In this section, we describe in detail the derivation of the model.Some excellent guidelines for elastic rods and material mechanics can be found in [6,7].Consider a planar elastic rod Γ, {d 1 , d 2 , d 3 } subject to an external force F and an external moment L. The balance between the internal force n and the internal moment m with F and L is expressed by the following equations: (1) {eq:balance} {eq:balance} where f and l represents respectively the external force F and the external moment L per unit of length.In other words, we write f = ∂ s F and l = ∂ s L.
Since we are just considering the gravity force, we employ the plane coordinates {e 1 , e 2 } to recast the first equation of system (1): Here g is the gravity acceleration constant, ρ 3 (s) is the volume density of the elastic rod and A(s) is the area of the cross-section of the rod at the point Γ(s).We assume that there are no internal forces acting at the tip of the rod.So, we get (2) {eq:balance_force} {eq:balance_force} The internal moment m per unit of length must be balanced with the moment per unit of length generated by the internal force n (second equation of system (1) with l = 0).This gives the relation: The combination of the Euler-Bernoulli equation for an elastic rod (Equation (1) in the main text), ( 2) and (3) give the balance equation ( 2) in the main text.Now, we want to prove the equation for the maximal stress (Equation (3) of the main text).We assume that stress σ and strain ε are proportional: We recall that z represents the distance from the centreline along β(s) on the cross-section C(s).We also assume that the strain ε has the following form: ε(s, z) = α(s)z, where α(s) is a proportionality constant that may vary along the rod.Since the stress σ(s, z) is applied to the infinitesimal strip L(C(s, z))dz, where L is the length (to be more precise, the Lebesgue measure) of C(s, z), the internal moment acting on the cross-section C(s) with respect to its center is Consequently, the maximal stress is at the edge of the cross-section, with z = max{|y| : C(s, y) ̸ = ∅}.